Compiling and Processing Historical and Contemporary Portuguese Corpora

نویسنده

  • Marcos Zampieri
چکیده

[email protected] University of Cologne, Albertus-Magnus Platz, 50923 Cologne, Germany Abstract This technical report describes the framework used for processing three large Portuguese corpora. Two corpora contain texts from newspapers, one published in Brazil and the other published in Portugal. The third corpus is Colonia, a historical Portuguese collection containing texts written between the 16 and the early 20 century. The report presents pre-processing methods, segmentation, and annotation of the corpora as well as indexing and querying methods. Finally, it presents published research papers using the corpora.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Building a Corpus-based Historical Portuguese Dictionary: Challenges and Opportunities

Historical corpora are important resources for different areas. Philology, Human Language Technology, Literary Studies, History, and Lexicography are some that benefit from them. However, compiling historical corpora is different from compiling contemporary corpora. Corpus designers have to deal with several characteristics inherent in historical texts, such as: absence of a spelling standard, ...

متن کامل

Grammatical Annotation of Historical Portuguese: Generating a Corpus-Based Diachronic Dictionary

In this paper, we present an automatic system for the morphosyntactic annotation and lexicographical evaluation of historical Portuguese corpora. Using rule-based orthographical normalization, we were able to apply a standard parser (PALAVRAS) to historical data (Colonia corpus) and to achieve accurate annotation for both POS and syntax. By aligning original and standardized word forms, our met...

متن کامل

Providing Internet Access to Portuguese Corpora: the AC/DC Project

In this paper we report on the activity of the project Computational Processing of Portuguese (Processamento computacional do português) in what concerns providing access to Portuguese corpora through the Internet. One of its activities, the AC/DC project (Acesso a corpora/Disponibilização de Corpora, roughly "Access and Availability of Corpora") allows a user to query around 40 million words o...

متن کامل

Speech Recognition for Brazilian Portuguese using the Spoltech and OGI-22 Corpora

Speech processing is a data-driven technology that relies on public corpora and associated resources. In contrast to languages such as English, there are few resources for Brazilian Portuguese (BP). This work describes efforts toward decreasing such gap and presents systems for speech recognition in BP using two public corpora: Spoltech and OGI-22. The following resources are made available: AT...

متن کامل

Commentary: Portuguese crypto-Jews: the genetic heritage of a complex history

Nogueiro et al. (2015) utilize Y chromosome and mitochondrial genotype data from a contemporary Iberian and non-Iberian human populations to explore the genetic identity of Portuguese “crypto-Jews.” In the first section of the paper, a historical introduction reviews the plight of Jews in the Iberian Peninsula from the earliest archaeological evidence, through the Inquisition, to the current da...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1710.00803  شماره 

صفحات  -

تاریخ انتشار 2017